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Abstract 

In order to use Bayesian methods to quantify uncertainty through the 
posterior, efficient problem dependent sampling algorithms are necessary. 
Motivated by an elliptic inverse problem with non-Gaussian prior, we 
study how to design proposal chains for the Metropolis-Hastings algorithm 
that scale well as the dimensions increases. Whereas there are results on 
dimension independent MCMC for Gaussian priors, we provide a simple 
recipe to achieve this for non- Gaussian priors. For the elliptic inverse 
problem we explicitly construct an efficient locally moving proposal chain 
and provide additional numerical evidence. 

More generally, we consider efficient proposals for Metropolis-Hastings 
algorithms for non-Gaussian target measures on functions spaces. Our 
main result is: 

Given a proposal Markov kernel which is reversible and has an L 2 - 
spectral gap with respect to a reference measure, then the Metropolis- 
Hastings algorithm applied to a measure with a bounded density has an 
L 2 -spectral gap, too. 

1 Introduction 

Due to the success of the applications of nonparametric Bayesian statistics [H| 
and Bayesian inverse problems [3T] to toy and and real-world problems jl] , there 
has been a growing demand for sampling techniques of high dimensional prob- 
ability distributions. In both disciplines the aim is to approximate expectations 
with respect to a probability distribution conditioned on data. These methods 
match the input and the parameter to the data while quantifying uncertainty 
through the posterior spread. The underlying mathematical models are often 
based on PDEs which have to be approximated for computations. These results 
are a trade-off between the sampling and the model-approximation error [15 . 
A finer discretization of a PDE leads to a smaller model-approximation error 
but also to higher dimensional state spaces. For many problems the sampling 
algorithms, which can be extended to the infinite dimensional state space of the 
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PDE, outperform the non-extendable methods. This insight was first properly 
stated by Stuart, Wiberg and Voss in (32]. Gaussian processes and priors have 
had a great impact in many areas |34| |9] . In particular since there is no ana- 
logue of the Lebesgue measure in infinite dimensions, Gaussian measures have 
often been used as building block. We refer to [5] for an overview over sampling 
methods which are well-defined in this case and based on Metropolis-Hastings 
Markov Chain Monte Carlo (MCMC [21 H3]) algorithms. 

We assume that the target measure ir is a Borel measure on the Banach 
space X. It is given by 

TT CX /7T 0) (1) 

where tto is a reference Borel probability measure on X. 

The proposal kernels for the efficient algorithms [5 are all reversible and 
have a spectral gap if the target measure is a Gaussian reference measure. 

For the preconditioned Crank-Nicolson algorithm, we have recently shown 
that this spectral gap transfers to the Metropolis-Hastings kernel as well under 
mild conditions [TT]. However, having a density with respect to a Gaussian 
measure is not a natural or possible assumption for all applications |291 121j . In 
image processing, Bayesian methods are used to recover images and Gaussian 
priors tend to remove the edges which are supposed to be recovered |3| . For the 
particular elliptic problem under consideration, non-Gaussian priors are desired 
by the scientists |12j . How uncertainty propagates from the input to the output 
of a model can be studied by considering realisations of the diffusion coefficient 
[1 . In fact it can be more efficiently to use generalised polynomial chaos [21] 
instead of Monte Carlo methods. This problem has first been analysed from the 
Bayesian perspective in [29 . 

The main analytic results in this article is that we show that the spectral 
gap is transferred from the proposal to the Metropolis-Hastings Markov kernel 
if the density is bounded below and above. This assumption is much stronger 
than the assumption for the pCN algorithm in the Gaussian case [TT] but it 
applies to all reference measures and proposal kernels. 

So far theoretical results, guaranteeing a spectral gap, were only available 
for the independence sampler, so called because only the acceptance probability 
depends on the current state but not the proposal. 

Moreover, we construct a class of MCMC methods performing local steps and 
having the same asymptotic complexity as the independence sampler considered 
in |29| . Numerically, the new methods yield a substantial improvement. 

1.1 Bayesian Inverse Problems 

The need for sampling algorithms, which behave well when the dimension in- 
creases, can be motivated by Bayesian inverse problems. Let a € X be the input 
of a mathematical model which is an initial condition for an ODE, a PDE or 
a discrete scheme and let y = Q (a) be the observable quantity where X and Y 
are Banach spaces and y £ Y denotes the data. The inverse problem is con- 
cerned with reconstructing a given the data y. Because Q can be non-injective 
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and there is always an error in the data, the problem is not exactly solvable as 
stated. It can be tackled by modelling the observational noise £ such that the 
data is generated by 



V = Q(a)+t 



The prior po(da) is a probability measure containing all the a priori inform- 
ation on a. If in addition the forward operator Q is given and the distributions 
of £ and Q satisfy mild assumptions, then there exists a conditional probability 
measure on a, called the posterior [!JT]. It contains all the information of the 
data about a. It takes the form 



However, this measure is only given as an unormalized density with respect 
to the prior. MCMC algorithms can now be used to approximate its mean or 
variance or any expectation with respect to the posterior . The trick is that one 
only needs the ratio of / at x and y to calculate the acceptance probability (see 
Section 1.2). 

In this case it is natural to take the prior as reference measure and the pos- 
terior as target measure. In order to implement the procedure on a computer, 
we have to discretize and approximate Q . Sampling methods that only apply to 
the discretize version often deteriorate as the dimension of the approximation in- 
creases. We have made this rigorous for tto being Gaussian [TT|. However, there 
are non-Gaussian reference measures for which only simple sampling methods 
have been investigated [T5] . For this reason we review the construction of the 
Metropolis-Hastings algorithm on general state spaces. 

1.2 Metropolis-Hastings on General State Spaces 

The common characteristic of MCMC algorithms is to create a Markov chain 
which preserves some prescribed target measure one wishes to sample from. 
Samples of this Markov chain satisfy a law of large numbers under mild con- 
ditions or a central limit theorem (CLT) under stronger conditions. The basic 
idea of Metropolis-Hastings algorithms is to alter a proposal Markov chain by 
accepting or rejecting a move so that the samples of the resulting chain satisfy 
a law of large numbers for the target distribution. 

We follow [33J to formulate the Metropolis-Hastings algorithm on the general 
state space E for target measure it. In The idea of the Metropolis-Hastings 
kernel is to add an accept-reject mechanism to a proposal Markov kernel Q(x, dy) 
to produce a Markov kernel P(x,dy) preserving the target measure. Hence 



Prior a ~ fiQ 

Noise £ with pdf p(rf) 

Likelihood y\ar.v. with pdf p(y — S{a)) 

Posterior cx p(y — Q(a)). 



(2) 
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The Metropolis-Hastings algorithm accepts a proposed move from x to y is 
accepted with acceptance probability a(x,y), thus P(x,dy) can be written as 

P(x,dy) = a(x,y)Q(x,dy) +S x (dy) ( 1 - / Q(x, dy)a(x, y) ) . 



If the Radon-Nikodym derivative ^ffiw^'^l exists, then choosing 



dTt(dx)Q{x,dy) 

dir(dy)Q(y,dx) 



^ \ ' diT(dx)Q(x, dy) 

makes P reversible with respect to tt |33| . If there is a common dominating 
measure /j, such that 

tt = fn,Q(x,dy) = q(x,y)n(dy), 

then 

«(*,») =##4 ( 3 ) 

f{ymx,y) 

Reversibility is a crucial property of the Metropolis-Hastings algorithm be- 
cause in combination with L 2 - or Wasserstein spectral gaps, this yields import- 
ant properties of the sample average (see Section |2| : 

Definition 1. A Markov kernel P is reversible with respect to a measure tto if 
ir (dx)Q(x,dy) = n (dy)Q(y,dx). 
If Q is reversible, then Q reduces is to 

In case of a Bayesian inverse problem the target corresponds to the posterior. 
It is natural to take ttq to be the prior. In this case / corresponds to the 
likelihood and the acceptance probability only depends on the quotient of the 
likelihood. 

The Metropolis-Hastings algorithm can be summarised as 

Algorithm 2. Initialise Xq. For i=0,. . . ,n do: 
Generate Y ~ Q{Xi 1 ■) and set 



Y with probability a(Xi,Y) 
Xi otherwise 



A,-|_i — 

The algorithm is run in order to use {Xi}" =1 to approximate J n(dx)f(x) 

n a +n 

S n ,n (f) = " E ft*)' 



n 

i=n a 
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The complexity of such an algorithm can be quantified as 



number of steps necesary x cost of one step. 

The cost of one step is usually easy to quantify for the problem at hand. If Xi 
were i.i.d. samples, the central limit theorem yields that the error is of order 
O(^). Much theoretical work is aimed at proving that this is still the case for 

the samples of an algorithm or even bounding the leading constant in 0(-^=). 
The methods to achieve this are related to different notions of convergence of 
the process to equilibrium. These can broadly be classified as follows |?7l I23| : 

1. For a metric on the space of measures such as the total variation or the 
Wasserstein metric the rate of convergence to equilibrium can be charac- 
terised through the decay of d(yV n , /i) where v is the initial distribution 
of the Markov chain. 

2. For the Markov operator V the convergence rate is given as the operator 
norm of V on a space of functions from X to R modulo constants. The 
most prominent example here is the L 2 -spectral gap. 

3. Direct methods like regeneration and the so-called split-chain which use 
the dynamics of the algorithm to introduce independence. The independ- 
ence can be used to prove a central limit theorem. 

The regeneration and total variation methods have been very successful in ob- 
taining rates for finite dimensional problems [26J. However, the necessity of 
the algorithm to be irreducible often fails for the infinite dimensional problem 
and thus often leads to detonation of the theoretical bounds as the dimension 
increases. One of the exceptions to this is the independence sampler [T5]. Ex- 
cept for E7] . the L 2 -spectral gap has so far mostly been used for finite 
or countable state Markov chains. For them it can be bounded in terms of 
quantities associated with its graph [Jj]. This idea has also been applied to the 
Metropolis- Algorithm in [30] and (8] . 



1.3 Outline 

In Section [2] we introduce spectral gaps and the consequences for the sample 
average before proving the main theorem. In Section [3] we construct a sample 
satisfying the condition of our main theorem for an elliptic inverse problem. For 
this example we present numerical results in Section ^] 

Acknowledgement. The author would like to thank Professor Andrew Stuart 
and Professor Martin Hairer for helpful discussions and proof reading of the 
paper. SJV is supported by ERC. 
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2 L 2 -Spectral Gaps from the Proposal to the Metropolis- 
Hastings Kernel 

Our main theorem will allow us to prove a spectral gap for the Metropolis- 
Hastings algorithm if the proposal kernel is reversible and has an L 2 -spectral 
gap with respect to a reference measure. 

If Q is reversible with respect to ttq, then a takes a particular simple form: 

Lemma 3. Suppose Q is reversible with respect ir , then 

a(x,y) = mm l, 

Proof. A short calculation yields 

dir(dy)Q(y,dx) df(y)TT Q (dy)Q(y,dx) 
dir(dx)Q(x,dy) df(x)wo(dx)Q(x,dy) 
_ df(y)TT (dx)Q(x,dy) 
df(x)TT {dx)Q(x,dy) 

m 

Wr 

□ 

2.1 L 2 -Spectral Gap 

We recall the definition of an L 2 — spectral gap first: 

Definition 4. (L 2 -spectral gap) A Markov operator V with invariant measure 
\i has an i^-spectral gap 1 — /3 if 

R II-DII PV-M/)ll2 ^ 

^ =M ^ = ;^i7^m <L 

The spectral gap implies favourable properties of the sample average such 
as bounds on the asymptotic variance of the CLT as well as a non-asymptotic 
bound on the mean square error (MSE). This reason together with the fact 
that geometric ergodicity implies an L 2 -spectral gap for reversible kernels [27 
justifies the consideration of spectral gaps as the rate of convergence. 

The following result of jH] (see also |T7] whence the statement was adapted) 
then yields a CLT: 

Proposition 5. Consider an ergodic Markov chain with transition operator P 
which is reversible with respect to a probability measure ir and has an L^-spectral 
gap 1 — p. Let f G L 2 be such that 



7 f.,P 



l + P 
1 - P 



G 



Then for Xq ~ /i the expression y/n(S„ — converges weakly to Af(0, a 2 P ). 

Moreover, the following inequality holds 

4 iP <2 M ((/ 2 - M (/)))/(l-/3)<^. 

We can now use the subsequent result on the mean square error in order to 
quantify the complexity |27| : 

Proposition 6. Suppose that we have a Markov chain with Markov operator V 
which has an L 2 -spectral gap 1 — j3. For p £ (2, oo] let no(p) 



Tlo(p) > 



2(p-2) 



32p x 
p-2 i 



du 



log(/3- 1 ) I io g (64) 



du 
dft 



P£ (2,4) 
P e [4,oo]. 



The 



sup E 
l/ll <i 



Sn.no 0?))' 



< 



n(l-/3) n 2 (l-/3) 2 



(4) 



2.2 Main Result 

We can now use results from [TSl I5U] to transfer the spectral gap from Q to P: 

Theorem 7. Suppose that the proposal kernel Q has an L 2 — spectral gap 1 — 
fiprop with respect to its invariant measure /io and that the target measure takes 
the form 

where /* := inf / < / < sup / =: /*. Then the spectral gap 1 — P of V satisfies 



(1 - /W 2 /8 < 1 - P < 2 



Proof. Using the notion of conductance 



C = inf 

n(A)<± 



j A V{x : A c )d^{x) 



(5) 



we obtain an upper bound on the spectral gap by Cheeger's inequality [181 130| 

C 2 



< 1 - < 2C. 



(6) 



This yields 



r A P(x,A c )d7r(i) > A/ A P(x,A c )^ (x) 



/* 



7T0(A) 



> 



> 



f*\ 2 f A Q(x,A c )d7r (x) 



MA) 
(1 - /W/2 
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where we used ([6]) in the last step. Taking the infimum and using ([6]) gives 

(£) 4 (l-/W 2 /8<l-/3. 
The other inequality can be derived similarly. □ 

Remark 8. This lower bound is best in the case of the independence sampler 
because the proposal samples are i.i.d. corresponding to 1 — f3 pr0 p = 1> 

3 Applications to an Elliptic Inverse Problem 

This investigation was motivated by studying an elliptic inverse problem from 
the Bayesian perspective. The determination of the diffusion coefficient a given 
noisy observations of the pressure p is a well-known inverse problem |H]. It 
was first considered from a Bayesian perspective in |29_ using a generalised 
polynomial chaos approach. This problem has been studied in various settings 
|201 [T51 RM1 129"] . In particular it is relevant in oil reservoir simulations and the 
modelling of groundwater flow |19| . In the following we consider it in an idealised 
setting. 

It is assumed that the relation between p and a is given by the following 
elliptic PDE with Dirichlet boundary conditions 

- V • (aVp) = f(x) in D, p = on dD (7) 

where D is a bounded smooth domain in M. d . We assume that / is smooth and 
satisfies / > If > and denote the solution operator to ^ by p(x; a) . 

The inverse problem is concerned with the reconstruction of a given obser- 
vations based on p: 

Assumption 9. The forward operator can be split into a composition of the 
solution operator p and an observation operator O 

g(a) = 0(p(-;a)). (8) 

1. The operator O : Y — > M. m satisfies 

\\G(a)\\ r < K no a.e. 

2. We have a uniform lower bound on the diffusion coefficient a > a min > 
ir, 0. 

Because the construction of efficient samplers depends on the prior and the 
likelihood, we specify a particular class of priors which were considered in [29, 
I15| . The prior is given by a linear combination (over a finite or countable index 
set J) with i.i.d. coefficients 

/Jo = £( a i x )) where 
a[x) — a(x) + r )jajijjj{x) where dj 1A ^' U{—1, 1). (9) 



8 



Under Assumption [9] it is easy to see that the density of the posterior with 
respect to the prior is bounded 1 15] . 

Lemma 10. If the Assumption^ holds, then there is aC 6l such that 

c-i < ^ < a 

In order to use Theorem [7] we have to choose Q such that it is reversible 
and has an L 2 -spectral gap with respect to the prior. 

Since the spectral gap has a product property (see [10J, we just have to 
take any proposal kernel which has a spectral gap with respect to U[— 1, 1] and 
then apply this proposal to each component and scale them appropriately. It 
is a simple exercise to check that scaling the proposal as well as the invariant 
measure preserves the spectral gap: 

Lemma 11. Let AeK, T x (x) = Xx and P x (x, •) = T*P(f , ■)• Suppose P has 
an L 2 -spectral gap 1 — j3 with respect to it, then P\ has the same spectral gap 
with respect to T^/x. 

The one-dimensional proposal can be constructed through a Metropolis- 
Hastings kernel with a one dimensional proposal distribution. The resulting 
kernel Q has a spectral gap under mild assumptions [23 . 

Alternatively, we consider a random walk with uniformly (— e, e) distributed 
steps reflected at the boundaries —1 and 1. The resulting transition kernel has 
a density with respect to the Lebesgue measure 

Qe(x,dy) = q e (x,y)dX 

{1 -1 < x,y < 1, \x - y\ < e, y > -x - 2 + e,y < -x + 2 - e 
2 -1 < x, y < 1, y < — x — 2 + e or y > — x + 2 — e 
otherwise 

This transition kernel is reversible and has an L 2 -spectral gap because q e (x, y) = 
q e (y, x) and the transition kernel corresponds to a Harris chain [23]. We call the 
resulting MCMC algorithm Uniform Reflection Metropolis-Hastings (URMH) 
algorithm. 

Due to Theorem [7] we obtain the following Theorem: 

Theorem 12. Let Q be one of the Markov kernels described above and let As- 
sumption^hold true. Further, we take Qd = ®^ =1 Q(aj,da,j) . The Metropolis- 
Hastings transition kernel VdfPoo) associated with Qd(Qoo) has an I? -spectral 
gap and there is a dimension-independent lower bound on it. 

4 Numerical Results 

We provide numerical results for the inverse problem based on Q with D — 
[0, 1] as considered in Section [31 Our bounds on the spectral gap and leading 
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constant for the 0{^=) error bound of the sample average is worse for the 
URMH algorithm than for the independence sampler (cf. Remark [8|. However, 
numerical results in this section suggest that the URMH algorithm outperforms 
the independence sampler if / is peaked. 

The computational cost of both algorithms is nearly the same because the 
cost of computing the likelihood is more expensive than generating the pro- 
posal. Generating the proposal for the URMH algorithm is only marginally 
more expensive than for the independence sampler. In addition to the d uni- 
formly distributed numbers O(d) sum- and multiplication operations have to be 
performed. 

Since we consider D — [0, 1], there is an explicit formula for the solution 

P{x) = ~ f ^l^ds, with c=- f ^±ds/ f -±-ds 
Jo a(-V Jo a(-V Jo a («) 

which we have implemented using a trapezoidal rule. We suppose the data 
is of the form 

y = Q(a) + Z = ( P (iAt))\ 1 J Ati +Z 
where £ ~ Af(0, <r 2 I) and choose the prior as above (JsJ 

/io = £{ a ( x )) where 
a(x) — d{x) + ^^^jajipj(x) where dj '~ ' U{— 1, 1). 

For our simulations we choose 



a(x) — 4.34, J = 20. 

ip 2 j-i(x) = cos(2?rja;), j > 1, j 2 j = ^ ' •? - 1 

ip2j(x) = sin(27rja;), j > 0, 72^-1 = ~, 

ipo(x) = 1, 70 = 1. 

We run an identical twin experiment by fixing a* as a particular realisations of 
the prior and generate test data with it. We choose a and At very small such 
that the posterior /j, y is peaked around 

At = 0.02, a = 0.005. 

Subsequently, we compare the Random Walk Metropolis (RWM) , Independ- 
ence Sampler (IS) and the Uniform-Rejection-Metropolis-Hastings (URMH) al- 
gorithms by plotting the autocorrellation. We consider J = 25 and J = 250 
corresponding to 51 and 501 dimensions respectively. 
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(a) 51 Dimensional State Space, Ac- 
ceptance Rate: URMH 21%, IS 0.04%, 
RWMH 24% 

Figure 1: 



(b) 501 Dimensional State Space, Ac- 
ceptance Rate: IS 0.02%, RWMH 
0.22%, URMH 21% 

Autocorrelation 



For the RWM we choose 7V(0, C) as step distribution with 

C = D(efi,...,efi K ) 

where D corresponds to the diagonal matrix. For fair comparison we choose 
the step size e to get an acceptance rate close to 0.234 [25]. For the URMH the 
acceptance rate is not affected by the choice of J. Similarly, the autocorrelation 
of the URMH is only affected up to a point see Figure [T] 

5 Conclusion and Avenues of Further Research 

We have shown how to transfer L 2 -spectral gaps from the proposal Markov ker- 
nel to the Metropolis-Hastings Markov kernel. These yield theoretical bounds 
for a large class of proposals for non-Gaussian measures on function spaces. 
These theoretical results justify the use of sampling methods other than the in- 
dependence sampler for the Bayesian elliptic inverse problem considered above. 

However, our bounds do not show that locally moving algorithms as the 
URMH algorithm designed in Section 4 are asymptotically better than the in- 
dependence sampler. Comparing two sampling algorithms is difficult since it 
depends on the specific target as well as the functional that one desires to 
approximate. For particular functionals one might be able to construct very 
efficient importance sampling schemes. 

Assuming that the density is bounded above and below is a very restrictive 
condition. But at this level of generality weaker assumptions seem difficult to 
prove. Replacing it by the assumption that the density is bounded above and 
below on bounded sets is reasonable. Both assumptions only differ in the tails. 
Considering just a large enough set decreases, the probability of a sampling 
algorithm actually ever leaving this set in a simulation to almost zero. But 
it is actually the tail behaviour which prevents algorithms of satisfying the 
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desired convergence properties [5] . Finding the right notion of convergence and 
implications for the sample average is a direction that could be pursued. 

For the Bayesian elliptic inverse problem considered in Sections 3 and 4, 
the range of the posterior density goes to infinity as the variance of the noise 
goes to zero. This suggests that sampling methods perform worse and worse 
as the observational noise goes to zero. Getting precise asymptotics of this 
behaviour would lead to a better understanding of the performance of sampling 
algorithms for Bayesian inverse problems. Maybe it is possible to combine 
statistical methods with methods of classical inverse problems in this situation 

Last but not least the Bayesian elliptic inverse problem should be extended 
to a multi-scale setup because in subsurface geophysics there is interest in the 
fine and coarse scale properties of the permeability. Homogenisation results 
suggest that different combinations of fine and coarse scales lead to effectively 
the same homogenised problem, thus leading to a lack of identifiability. This 
seems to be a very interesting idea for further investigation. 
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